correct program
- North America > United States (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- Asia > India > Karnataka > Bengaluru (0.04)
ExPairT-LLM: Exact Learning for LLM Code Selection by Pairwise Queries
Yuviler, Tom, Drachsler-Cohen, Dana
Despite recent advances in LLMs, the task of code generation is still challenging. To cope, code selection algorithms select the best program from multiple programs generated by an LLM. However, existing algorithms can fail to identify the correct program, either because they fail to distinguish nonequivalent programs or because they rely on an LLM and assume it always correctly determines the output for every input. We present ExPairT-LLM, an exact learning algorithm for code selection that selects a program by posing two new types of queries to an LLM oracle: pairwise membership and pairwise equivalence. These queries are simpler for LLMs and enable ExPairT-LLM to identify the correct program through a tournament, which is robust to some LLM mistakes. We evaluate ExPairT-LLM on four popular code datasets. Its pass@1 (success rate) outperforms the state-of-the-art code selection algorithm on average by +13.0% and up to +27.1%. It also improves the pass@1 of LLMs performing complex reasoning by +24.0%.
- Asia > Middle East > Jordan (0.04)
- North America > United States > Michigan (0.04)
- Asia > Middle East > Israel > Haifa District > Haifa (0.04)
- North America > United States (0.04)
- North America > Canada (0.04)
- Asia > India > Karnataka > Bengaluru (0.04)
- Instructional Material (0.68)
- Research Report (0.46)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Reviews: Sampling for Bayesian Program Learning
I found this paper interesting and well-written, but I have some significant questions and comments about the approach. The paper argues that sampling is useful because we can find the C most frequently sampled programs and show them to a user. As shown in Figure 6, there is more likely to be a correct program in the top 3 programs than in the top 1. But if we want to show the top C programs, do we really need to perform sampling, which the paper says is complicated by the existence of many long and unlikely programs that match the training examples? Why can't we simply find the MDL program and then run the solver again with length restrictions to find other consistent programs of the same length, or slightly longer lengths?
Top Pass: Improve Code Generation by Pass@k-Maximized Code Ranking
Lyu, Zhi-Cun, Li, Xin-Ye, Xie, Zheng, Li, Ming
Code generation has been greatly enhanced by the profound advancements in Large Language Models (LLMs) recently. Nevertheless, such LLM-based code generation approaches still struggle to generate error-free code in a few tries when faced with complex problems. To address this, the prevailing strategy is to sample a huge number of candidate programs, with the hope of any one in them could work. However, users of code generation systems usually expect to find a correct program by reviewing or testing only a small number of code candidates. Otherwise, the system would be unhelpful. In this paper, we propose Top Pass, a code ranking approach that identifies potential correct solutions from a large number of candidates. Top Pass directly optimizes the pass@k loss function, enhancing the quality at the top of the candidate list. This enables the user to find the correct solution within as few tries as possible. Experimental results on four benchmarks indicate that our Top Pass method enhances the usability of code generation models by producing better ranking results, particularly achieving a 32.9\% relative improvement in pass@1 on CodeContests when compared to the state-of-the-art ranking method.
- Asia > China > Jiangsu Province > Nanjing (0.05)
- Asia > China > Shaanxi Province > Xi'an (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Automatic Programming (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)
Refactoring Programs Using Large Language Models with Few-Shot Examples
Shirafuji, Atsushi, Oda, Yusuke, Suzuki, Jun, Morishita, Makoto, Watanobe, Yutaka
A less complex and more straightforward program is a crucial factor that enhances its maintainability and makes writing secure and bug-free programs easier. However, due to its heavy workload and the risks of breaking the working programs, programmers are reluctant to do code refactoring, and thus, it also causes the loss of potential learning experiences. To mitigate this, we demonstrate the application of using a large language model (LLM), GPT-3.5, to suggest less complex versions of the user-written Python program, aiming to encourage users to learn how to write better programs. We propose a method to leverage the prompting with few-shot examples of the LLM by selecting the best-suited code refactoring examples for each target programming problem based on the prior evaluation of prompting with the one-shot example. The quantitative evaluation shows that 95.68% of programs can be refactored by generating 10 candidates each, resulting in a 17.35% reduction in the average cyclomatic complexity and a 25.84% decrease in the average number of lines after filtering only generated programs that are semantically correct. Furthermore, the qualitative evaluation shows outstanding capability in code formatting, while unnecessary behaviors such as deleting or translating comments are also observed.
Program Repair with Minimal Edits Using CodeT5
Shirafuji, Atsushi, Rahman, Md. Mostafizer, Amin, Md Faizul Ibne, Watanobe, Yutaka
Programmers often struggle to identify and fix bugs in their programs. In recent years, many language models (LMs) have been proposed to fix erroneous programs and support error recovery. However, the LMs tend to generate solutions that differ from the original input programs. This leads to potential comprehension difficulties for users. In this paper, we propose an approach to suggest a correct program with minimal repair edits using CodeT5. We fine-tune a pre-trained CodeT5 on code pairs of wrong and correct programs and evaluate its performance with several baseline models. The experimental results show that the fine-tuned CodeT5 achieves a pass@100 of 91.95% and an average edit distance of the most similar correct program of 6.84, which indicates that at least one correct program can be suggested by generating 100 candidate programs. We demonstrate the effectiveness of LMs in suggesting program repair with minimal edits for solving introductory programming problems.
- Asia > Japan (0.05)
- Asia > Bangladesh > Dhaka Division > Dhaka District > Dhaka (0.04)
- Instructional Material (0.68)
- Research Report > New Finding (0.34)
LEVER: Learning to Verify Language-to-Code Generation with Execution
Ni, Ansong, Iyer, Srini, Radev, Dragomir, Stoyanov, Ves, Yih, Wen-tau, Wang, Sida I., Lin, Xi Victoria
The advent of large language models trained on code (code LLMs) has led to significant progress in language-to-code generation. State-of-the-art approaches in this area combine LLM decoding with sample pruning and reranking using test cases or heuristics based on the execution results. However, it is challenging to obtain test cases for many real-world language-to-code applications, and heuristics cannot well capture the semantic features of the execution results, such as data type and value range, which often indicates the correctness of the program. In this work, we propose LEVER, a simple approach to improve language-to-code generation by learning to verify the generated programs with their execution results. Specifically, we train verifiers to determine whether a program sampled from the LLMs is correct or not based on the natural language input, the program itself and its execution results. The sampled programs are reranked by combining the verification score with the LLM generation probability, and marginalizing over programs with the same execution results. On four datasets across the domains of table QA, math QA and basic Python programming, LEVER consistently improves over the base code LLMs(4.6% to 10.9% with code-davinci-002) and achieves new state-of-the-art results on all of them.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Europe > Belarus > Minsk Region > Minsk (0.04)
- (17 more...)
- Research Report (1.00)
- Overview > Innovation (0.34)